Add learning rate scheduling support for `DeepSpeedStrategy` #20320

amorehead · 2024-10-05T02:35:29Z

What does this PR do?

Adds learning rate scheduling support for DeepSpeedStrategy
Credit to lvhoaa for suggesting this change to make Fabric's support for internal DeepSpeed features even more robust

Before submitting

Was this discussed/agreed via a GitHub issue? (not for typos and docs) N
x ] Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

📚 Documentation preview 📚: https://pytorch-lightning--20320.org.readthedocs.build/en/20320/

for more information, see https://pre-commit.ci

lantiga · 2024-10-07T11:39:43Z

Thanks for the contribution @amorehead! Let's get to a green CI and take it from there

for more information, see https://pre-commit.ci

lantiga · 2024-11-12T22:49:13Z

hey @amorehead looks like CI failures are legit, let me know if you can fix those

lantiga

Thank you @amorehead! I added a few comments. Essentially we need to turn this into a non-breaking change.

Also a small update to docs is needed.

lantiga · 2024-12-08T03:38:48Z

src/lightning/fabric/strategies/deepspeed.py

-
-        Currently, only a single optimizer is supported.
+        self, module: Module, optimizers: list[Optimizer], scheduler: Optional[_LRScheduler] = None
+    ) -> tuple["DeepSpeedEngine", list[Optimizer], Optional[_LRScheduler]]:


This will return None, we need to return Any here so we can ignore the scheduler if it is not provided in input.

Thanks! Addressed in this commit.

lantiga · 2024-12-08T03:38:51Z

src/lightning/fabric/fabric.py

@@ -266,7 +269,7 @@ def setup(

        if optimizers:
            # join both types in a tuple for API convenience
-            return (module, *optimizers)
+            return (module, *optimizers, scheduler)


This is a breaking change, it will cause existing user code to fail, because scheduler is returned unconditionally.

Since scheduler is Optional in the signature, I suggest we only return it if it was not None as an argument, so we won't break anyone's code.

Agreed. Addressed in this commit.

lantiga · 2024-12-08T03:39:02Z

src/lightning/fabric/strategies/deepspeed.py

-        optimizer: Optional[Optimizer] = None,
-    ) -> tuple["DeepSpeedEngine", Optimizer]:
+        self, model: Module, optimizer: Optional[Optimizer] = None, scheduler: Optional[_LRScheduler] = None
+    ) -> tuple["DeepSpeedEngine", Optimizer, Optional[_LRScheduler]]:


Same comment as above

Addressed in this commit.

lantiga · 2024-12-08T03:39:29Z

src/lightning/fabric/utilities/seed.py

@@ -104,7 +104,10 @@ def pl_worker_init_function(worker_id: int, rank: Optional[int] = None) -> None:
    if _NUMPY_AVAILABLE:
        import numpy as np

-        np.random.seed(seed_sequence[3] & 0xFFFFFFFF)  # numpy takes 32-bit seed only
+        ss = np.random.SeedSequence([base_seed, worker_id, global_rank])


This is an unrelated change, it shouldn't be included

Since this is now merged into master per this previous pull request, is this comment still relevant?

lantiga · 2024-12-10T22:35:19Z

@amorehead I'm wrapping up the last few PRs for the release. Do you have time to fix this one in the next couple of days?

amorehead · 2025-01-09T19:29:59Z

@lantiga, apologies. Just now getting to fixing this pull request up. I've updated the docs/source-fabric/api/fabric_methods.rst file. Are there any other relevant docs I've missed? I believe I've already updated all the relevant docstrings for each affected Strategy such as the DeepSpeedStrategy, so these corresponding docstring docs should already be updated.

codecov · 2025-01-09T21:18:23Z

Codecov Report

Attention: Patch coverage is 85.71429% with 2 lines in your changes missing coverage. Please review.

Project coverage is 87%. Comparing base (fe79be1) to head (d3e839c).
Report is 1 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #20320      +/-   ##
==========================================
+ Coverage      41%      87%     +46%     
==========================================
  Files         265      268       +3     
  Lines       23394    23455      +61     
==========================================
+ Hits         9476    20395   +10919     
+ Misses      13918     3060   -10858

Borda · 2025-03-14T12:34:10Z

@amorehead mind check the last failng case:

FAILED strategies/test_model_parallel.py::test_parallelize_fn_call - ValueError: too many values to unpack (expected 2)

amorehead · 2025-03-15T15:42:08Z

@Borda, I've just fixed this test

Borda · 2025-03-17T11:51:02Z

@Borda, I've just fixed this test

seems one left:

FAILED strategies/test_deepspeed.py::test_deepspeed_setup_module - AssertionError: expected call not found.
Expected: initialize(args=<ANY>, config={'activation_checkpointing': {'partition_activations': False, 'cpu_checkpointing': False, 'contiguous_memory_optimization': False, 'synchronize_checkpoint_boundary': False}, 'aio': {'block_size': 1048576, 'queue_depth': 8, 'single_submit': False, 'overlap_events': True, 'thread_count': 1}, 'zero_allow_untested_optimizer': True, 'zero_optimization': {'stage': 2, 'contiguous_gradients': True, 'overlap_comm': True, 'allgather_partitions': True, 'reduce_scatter': True, 'allgather_bucket_size': 200000000, 'reduce_bucket_size': 200000000, 'sub_group_size': 1000000000000}}, model=<Mock id='140346019654640'>, model_parameters=<ANY>, optimizer=None, dist_init_required=False)
Actual: initialize(args=Namespace(device_rank=1), config={'activation_checkpointing': {'partition_activations': False, 'cpu_checkpointing': False, 'contiguous_memory_optimization': False, 'synchronize_checkpoint_boundary': False}, 'aio': {'block_size': 1048576, 'queue_depth': 8, 'single_submit': False, 'overlap_events': True, 'thread_count': 1}, 'zero_allow_untested_optimizer': True, 'zero_optimization': {'stage': 2, 'contiguous_gradients': True, 'overlap_comm': True, 'allgather_partitions': True, 'reduce_scatter': True, 'allgather_bucket_size': 200000000, 'reduce_bucket_size': 200000000, 'sub_group_size': 1000000000000}}, model=<Mock id='140346019654640'>, model_parameters=<filter object at 0x7fa4daac0e80>, optimizer=None, lr_scheduler=None, dist_init_required=False)

amorehead · 2025-03-19T14:43:33Z

@Borda, let's see if this latest commit of mine fixes it.

stale · 2025-04-16T05:23:21Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://lightning.ai/docs/pytorch/latest/generated/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Discord. Thank you for your contributions.

amorehead · 2025-04-17T15:26:10Z

@Borda, may I ask for you to check the "Read the Docs" tests and why they are failing?

amorehead added 8 commits October 4, 2024 21:15

Update fabric.py

188a45f

Update deepspeed.py

baf5988

Update deepspeed.py

1f4c18e

Update fabric.py

585e302

Update fsdp.py

0451761

Update strategy.py

a912aab

Update strategy.py

d27d4a3

Update xla_fsdp.py

67089a1

amorehead requested review from lantiga, Borda, tchaton and justusschock as code owners October 5, 2024 02:35

github-actions bot added the fabric lightning.fabric.Fabric label Oct 5, 2024

pre-commit-ci bot and others added 5 commits October 5, 2024 02:35

[pre-commit.ci] auto fixes from pre-commit.com hooks

1025875

for more information, see https://pre-commit.ci

Update fsdp.py

9b45b99

Update strategy.py

a7a5835

Update xla_fsdp.py

3ece31c

Update deepspeed.py

e48acd2

amorehead and others added 6 commits October 28, 2024 11:25

Update seed.py

f13516d

[pre-commit.ci] auto fixes from pre-commit.com hooks

80b4a6d

for more information, see https://pre-commit.ci

Update seed.py

2cab7e2

Update seed.py

e9127f4

Update seed.py

c127458

Merge branch 'master' into patch-2

31a1fce

lantiga added the strategy: deepspeed label Nov 12, 2024

lantiga added the waiting on author Waiting on user action, correction, or update label Nov 12, 2024

Merge branch 'master' into patch-2

f215626

mergify bot added the has conflicts label Nov 25, 2024

lantiga added 3 commits November 25, 2024 11:35

Update src/lightning/fabric/strategies/fsdp.py

2d347d0

Update src/lightning/fabric/strategies/strategy.py

5d227ff

Update src/lightning/fabric/strategies/xla_fsdp.py

f94efa7

lantiga reviewed Dec 8, 2024

View reviewed changes

amorehead added 3 commits January 9, 2025 13:12

Merge branch 'Lightning-AI:master' into patch-2

c2613ec

Update deepspeed.py

56464ed

Update fabric.py

e09941c

Update fabric_methods.rst

3709f1d

github-actions bot added the docs Documentation related label Jan 10, 2025

amorehead and others added 3 commits January 10, 2025 11:12

Update wrappers.rst

13195a2

Merge branch 'Lightning-AI:master' into patch-2

3791c1b

Merge branch 'master' into patch-2

6c43663

Update test_model_parallel.py

5a6e7cd

Update test_deepspeed.py

c37ea00

Borda and others added 2 commits March 21, 2025 08:43

Merge branch 'master' into patch-2

ca965ae

Merge branch 'master' into patch-2

65d2624

stale bot added the won't fix This will not be worked on label Apr 16, 2025

Merge branch 'master' into patch-2

bf16bd7

stale bot removed the won't fix This will not be worked on label Apr 16, 2025

Merge branch 'master' into patch-2

20856b5

Merge branch 'master' into patch-2

d3e839c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add learning rate scheduling support for `DeepSpeedStrategy` #20320

Add learning rate scheduling support for `DeepSpeedStrategy` #20320

amorehead commented Oct 5, 2024 •

edited

Loading

lantiga commented Oct 7, 2024

lantiga commented Nov 12, 2024

lantiga left a comment

lantiga Dec 8, 2024

amorehead Jan 9, 2025 •

edited

Loading

lantiga Dec 8, 2024

amorehead Jan 9, 2025

lantiga Dec 8, 2024

amorehead Jan 9, 2025 •

edited

Loading

lantiga Dec 8, 2024

amorehead Jan 9, 2025 •

edited

Loading

lantiga commented Dec 10, 2024

amorehead commented Jan 9, 2025 •

edited

Loading

codecov bot commented Jan 9, 2025 •

edited

Loading

Borda commented Mar 14, 2025

amorehead commented Mar 15, 2025

Borda commented Mar 17, 2025

amorehead commented Mar 19, 2025

stale bot commented Apr 16, 2025

amorehead commented Apr 17, 2025

Add learning rate scheduling support for DeepSpeedStrategy #20320

Are you sure you want to change the base?

Add learning rate scheduling support for DeepSpeedStrategy #20320

Conversation

amorehead commented Oct 5, 2024 • edited Loading

What does this PR do?

PR review

lantiga commented Oct 7, 2024

lantiga commented Nov 12, 2024

lantiga left a comment

Choose a reason for hiding this comment

lantiga Dec 8, 2024

Choose a reason for hiding this comment

amorehead Jan 9, 2025 • edited Loading

Choose a reason for hiding this comment

lantiga Dec 8, 2024

Choose a reason for hiding this comment

amorehead Jan 9, 2025

Choose a reason for hiding this comment

lantiga Dec 8, 2024

Choose a reason for hiding this comment

amorehead Jan 9, 2025 • edited Loading

Choose a reason for hiding this comment

lantiga Dec 8, 2024

Choose a reason for hiding this comment

amorehead Jan 9, 2025 • edited Loading

Choose a reason for hiding this comment

lantiga commented Dec 10, 2024

amorehead commented Jan 9, 2025 • edited Loading

codecov bot commented Jan 9, 2025 • edited Loading

Codecov Report

Borda commented Mar 14, 2025

amorehead commented Mar 15, 2025

Borda commented Mar 17, 2025

amorehead commented Mar 19, 2025

stale bot commented Apr 16, 2025

amorehead commented Apr 17, 2025

Add learning rate scheduling support for `DeepSpeedStrategy` #20320

Add learning rate scheduling support for `DeepSpeedStrategy` #20320

amorehead commented Oct 5, 2024 •

edited

Loading

amorehead Jan 9, 2025 •

edited

Loading

amorehead Jan 9, 2025 •

edited

Loading

amorehead Jan 9, 2025 •

edited

Loading

amorehead commented Jan 9, 2025 •

edited

Loading

codecov bot commented Jan 9, 2025 •

edited

Loading